cuda.bindings latency benchmarks - part 2 by danielfrg · Pull Request #1856 · NVIDIA/cuda-python

danielfrg · 2026-04-03T15:13:45Z

Description

closes #1580

Follow up #1580

Adding a couple of more benchmarks here and fixing a couple of issue with the pyperf json handling.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-03T15:13:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

danielfrg · 2026-04-03T15:33:34Z

There are the results for a run on my dev machine (4090)


----------------------------------------------------------------------------------
Benchmark                                   C++ (mean)   Python (mean)    Overhead
----------------------------------------------------------------------------------
ctx_device.ctx_get_current                        6 ns          112 ns     +106 ns
ctx_device.ctx_get_device                         8 ns          122 ns     +113 ns
ctx_device.ctx_set_current                        8 ns          103 ns      +96 ns
ctx_device.device_get                             6 ns          126 ns     +120 ns
ctx_device.device_get_attribute                   9 ns          195 ns     +186 ns
event.event_create_destroy                       90 ns          307 ns     +218 ns
event.event_query                                74 ns          215 ns     +140 ns
event.event_record                               93 ns          229 ns     +136 ns
event.event_synchronize                          94 ns          239 ns     +145 ns
launch.launch_16_args                          1.57 us         3.12 us    +1545 ns
launch.launch_16_args_pre_packed               1.58 us         1.99 us     +409 ns
launch.launch_empty_kernel                     1.54 us         1.85 us     +302 ns
launch.launch_small_kernel                     1.54 us         2.23 us     +690 ns
pointer_attributes.pointer_get_attribute         29 ns          511 ns     +482 ns
stream.stream_create_destroy                   3.78 us         4.06 us     +274 ns
stream.stream_query                              86 ns          232 ns     +145 ns
stream.stream_synchronize                       111 ns          257 ns     +146 ns
----------------------------------------------------------------------------------

rwgk · 2026-04-06T04:22:50Z

It's the first time that I'm looking at this code. My second question was: What's the purpose of the benchmarks. Cursor (GPT-5.4 Extra High Fast) offered some answers. I asked it to generate a "Motivation" section based on what it found, see below. I think it'd be a great addition to cuda_bindings/benchmarks/README.md.

Motivation

These benchmarks are intended to measure the latency overhead of calling CUDA Driver APIs through cuda.bindings, relative to a similar C++ baseline.

The main goal is to help answer questions such as:

How much overhead does the Python binding layer add to very small CUDA API calls?
Are we staying within our target of keeping Python overhead below roughly 1 us for representative operations?
Do changes to argument conversion, result handling, or wrapper internals introduce measurable regressions?

The paired C++ benchmarks are included to provide a lower-level reference point for the same operation. Comparing Python and C++ results helps estimate the additional cost introduced by the Python-to-C boundary and by binding-specific marshalling work.

These benchmarks are not intended to measure overall GPU performance, kernel throughput, or end-to-end application speed. Most of the benchmarked operations are deliberately tiny, so the reported numbers are best interpreted as binding/API-call latency measurements and regression signals, rather than as predictions of full application performance.

Because the benchmarked operations are so small, methodology matters a lot. The most useful comparisons are between Python and C++ benchmarks that perform as nearly the same work as possible and are run under similar conditions.

rwgk · 2026-04-06T04:31:59Z

My first question (to Cursor) when reviewing this PR was:

I'm not very familiar with benchmarking. I glanced through the PR, it seems to set up pairs of equivalent C++ / Python tests, runs both, and then computes the overhead resulting from running via Python. Could you please look carefully of the C++ / Python pairs are actually equivalent? Is there anything that looks like it would lead to distorted/systematically biased results?

After it gave me the response below I started thinking about the motivation, with the result in the previous comment. In light of that, the findings below still seem relevant, but I'd need to look closer to be more certain which of the "not clean apples-to-apples" aspects it found are actually meaningful. I hope they are at least a good starting point for figuring it out together, so I'm copy-pasting them below.

Findings

High: event.event_query and event.event_synchronize are not actually paired with the Python versions. The C++ binary records onto the shared event first at cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:60, then benchmarks query/synchronize on that same handle at cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:67 and cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:75 without restoring the original completed-event state. The Python side sets up a completed event at cuda_bindings/benchmarks/benchmarks/bench_event.py:16 and cuda_bindings/benchmarks/benchmarks/bench_event.py:17, and each benchmark runs in its own pyperf worker from cuda_bindings/benchmarks/runner/main.py:132, so event_query and event_synchronize start clean there. That makes those rows non-equivalent.
High: The launch.* benchmarks are not clean apples-to-apples overhead measurements. Both sides enqueue asynchronously with no sync/reset inside the timed loop at cuda_bindings/benchmarks/benchmarks/bench_launch.py:64, cuda_bindings/benchmarks/benchmarks/bench_launch.py:77, cuda_bindings/benchmarks/benchmarks/bench_launch.py:90, cuda_bindings/benchmarks/benchmarks/bench_launch.py:102, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:136, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:147, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:157, and cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:169. Python is driven through pyperf at cuda_bindings/benchmarks/runner/main.py:132, which calibrates loop counts, while C++ uses fixed loops = 1000 at cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:21. So the two sides are almost certainly measuring different queue-depth and backpressure regimes. On top of that, the later C++ launch rows reuse the same stream sequentially from cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:112, so they inherit prior queued work in a way the Python rows do not.
Medium: Even the synchronous rows are not measured under the same harness. Python uses pyperf.Runner().bench_time_func(...) at cuda_bindings/benchmarks/runner/main.py:132, while C++ uses one in-process custom suite with fixed warmups, values, and runs at cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:21, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:22, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:23, cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:24, and cuda_bindings/benchmarks/benchmarks/cpp/bench_support.hpp:103. For 6-100 ns calls, that is enough to change timer overhead, loop overhead, and process-isolation behavior, so the delta is not a pure wrapper-overhead number.
Medium: The C++ baseline includes extra hot-loop work that Python does not. Most C++ rows wrap each CUDA call in check_cu(...) at cuda_bindings/benchmarks/benchmarks/cpp/bench_ctx_device.cpp:42, cuda_bindings/benchmarks/benchmarks/cpp/bench_event.cpp:53, cuda_bindings/benchmarks/benchmarks/cpp/bench_stream.cpp:46, cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:135, and cuda_bindings/benchmarks/benchmarks/cpp/bench_pointer_attributes.cpp:45. The Python loops just discard the returned status at cuda_bindings/benchmarks/benchmarks/bench_ctx_device.py:22, cuda_bindings/benchmarks/benchmarks/bench_event.py:29, cuda_bindings/benchmarks/benchmarks/bench_stream.py:23, cuda_bindings/benchmarks/benchmarks/bench_launch.py:64, and cuda_bindings/benchmarks/benchmarks/bench_pointer_attributes.py:24. That makes the C++ side a little slower and systematically understates Python overhead, especially on the fastest context/device rows.
Medium: launch.launch_16_args is intentionally not equivalent to the C++ row. Python passes (_args, _arg_types) at cuda_bindings/benchmarks/benchmarks/bench_launch.py:90, so the binding has to marshal Python arguments every call. C++ launch_16_args and launch_16_args_pre_packed are the same pre-packed void*[] launch at cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:155 and cuda_bindings/benchmarks/benchmarks/cpp/bench_launch.cpp:167. The launch_16_args_pre_packed row is the closer wrapper-only comparison; the plain launch_16_args row measures Python marshalling overhead too.

What Looks Reasonably Matched

ctx_device.*, pointer_attributes.pointer_get_attribute, stream.stream_*, and event.event_create_destroy look structurally close in API choice and setup.
Their exact nanosecond deltas should still be treated cautiously because of the harness mismatch and the extra check_cu(...) cost on the C++ side.
event.event_record is closer than the other async rows, but it still suffers from the async queue-depth mismatch between pyperf-calibrated Python runs and fixed-loop C++ runs.

Bottom Line

Some pairs are fine directionally, but the PR is not consistently comparing truly equivalent C++ and Python work.
event.event_query, event.event_synchronize, and the launch.* rows should not be treated as clean Python-overhead-over-equivalent-C++ measurements.
If the intent is a defensible binding-overhead study, the biggest fixes are: use the same harness model on both sides, reset and synchronize async state between values and between named benchmarks, and either remove check_cu(...) from the timed C++ loop or add equivalent explicit status handling on the Python side.

Note

This review was based on code inspection plus a local check of the pyperf worker and loop behavior used by cuda_bindings/benchmarks/runner/main.py.

danielfrg · 2026-04-07T16:13:29Z

yeah the motivation is correct, its just latency/overhead of the python layer, not throughput. I'll add it to the readme.

And yeah the review i think i would agree with most of it on the ones marked as high (and I will try to match them closer) but for the other ones i think its almost impossible to do a full apples-to-apples so for those i am not sure i would change much but i'll leave it up to you all to make that call :D
I am happy to fix to whatever we believe is a closer comparison.

danielfrg · 2026-04-07T16:30:38Z

Ok, i added a couple of cuStreamSynchronize to make it a bit better. Claude doesn't seem to think this is that big of a problem:

**Event query/synchronize state (High)**

Fair point on code clarity. The practical impact is zero — the stream is idle and non-blocking, so after `event_record` benchmarks the event is still in a completed state. But I've added an explicit `cuStreamSynchronize(stream)` between the `event_record` and `event_query` benchmarks on the C++ side to make the intent clear and future-proof it.

About the second comment, i think its "ok". the C++ one doesn't match pyperf fully when it does a bit more fancy stats for the warm up and number of measurements and the C++ is a fixed count but i don't think it should affect much specially for measuring host latency?

danielfrg added the performance label Apr 3, 2026

danielfrg added 5 commits April 3, 2026 10:21

Add bench_ctx_device and fix JSON output

a843f75

Remove prefix so we can compare benchmarks

780b435

Add bench_event and bench_stream and compare script for a summary table

90b5e0b

Add bench_event and bench_stream and compare script for a summary table

8126ab7

Add Launch benchmarks

a3f0678

danielfrg force-pushed the cuda-bindings-bench-more branch from 0cfea1d to a3f0678 Compare April 3, 2026 15:25

danielfrg requested review from mdboom and rwgk April 3, 2026 16:52

danielfrg self-assigned this Apr 3, 2026

Lint

e4762ed

Add motivation to readme

170578c

Add cuStreamSyncrhonize

c64dada

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda.bindings latency benchmarks - part 2#1856

cuda.bindings latency benchmarks - part 2#1856
danielfrg wants to merge 8 commits intomainfrom
cuda-bindings-bench-more

danielfrg commented Apr 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

danielfrg commented Apr 3, 2026

Uh oh!

rwgk commented Apr 6, 2026

Uh oh!

rwgk commented Apr 6, 2026

Uh oh!

danielfrg commented Apr 7, 2026 •

edited

Loading

Uh oh!

danielfrg commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielfrg commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

danielfrg commented Apr 3, 2026

Uh oh!

rwgk commented Apr 6, 2026

Motivation

Uh oh!

rwgk commented Apr 6, 2026

Findings

What Looks Reasonably Matched

Bottom Line

Note

Uh oh!

danielfrg commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielfrg commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielfrg commented Apr 3, 2026 •

edited

Loading

danielfrg commented Apr 7, 2026 •

edited

Loading